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Abstract 

This  paper  studies  the  implicit  costs  of  synchronization  and  the  advantage  that  may 
be  gained  by  avoiding  synchronization  in  asynchronous  environments.  An  asynchronous 
generahzation  of  the  PRAM  model  called  the  APRAM  model  is  used  and  appropriate 
complexity  measures  are  defined.  The  advantage  asynchrony  provides  is  illustrated  by 
analyzing  two  algorithms:  a  parallel  summation  algorithm  which  proceeds  along  an 
implicit  complete  binary  tree  and  a  recursive  doubling  algorithm  which  proceeds  along 
a  linked  list. 

1      Introduction 

This  paper  continues  our  study  of  the  effect  of  process  asynchrony  on  parallel  algorithm 
design.  As  is  well  known,  the  main  effort  in  parallel  algorithm  design  has  employed  the 
PRAM  model.  This  model  hides  many  of  the  implementation  issues,  allowing  the  algorithm 
designer  to  focus  first  and  foremost  on  the  structure  of  the  computational  problem  at  hand 
—  synchronization  is  one  of  these  hidden  issues. 

This  paper  is  part  of  a  broader  research  effort  which  has  sought  to  take  into  account 
some  of  the  implementation  issues  hidden  by  the  PRAM  model.  Broadly  speaking,  two  ma- 
jor approaches  have  been  followed.  One  body  of  research  is  concerned  with  asynchrony  and 
the  resulting  non-uniform  environment  in  which  processes  operate^  [CZ89,Nis90,MPS89], 
[MSP90,AW91].  The  other  body  of  research  has  considered  the  effect  of  issues  such  as  la- 
tency to  memory,  but  assumes  a  uniform  environment  for  the  processes  [PU87,PY88,AC88], 
[ACS89,Gib89]. 

The  PRAM  is  a  synchronous  model  and  thus  it  strips  away  problems  of  synchronization. 
However,  the  implicit  synchronization  provided  by  the  model  hides  the  synchronization  costs 
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from  the  user.  Our  work  has  proceeded  in  two  directions.  In  [CZ89,CZ91a,CZ91b]  we  studied 
the  cost  of  achieving  synchronization,  which  we  called  the  explicit  costs  of  synchronization. 
This  includes  the  cost  of  executing  extra  code  that  must  be  added  to  the  algorithm  in  order  to 
synchronize.  A  PRAM  algorithm,  for  example,  assumes  that  all  processes  synchronize  before 
every  step.  By  considering  the  exphcit  costs  of  synchronization  we  can  design  algorithms 
which  require  less  synchronization.  It  is  anticipated  that  such  algorithms  will  perform  better 
in  asynchronous  settings. 

In  this  paper,  we  investigate  the  effects  of  nonuniformity  of  the  environments  on  the  com- 
plexity of  algorithms.  For  asynchronous  environments  may  well  produce  such  nonuniformity: 
different  processes  may  proceed  at  different  speeds  and  even  the  same  process  may  proceed 
at  different  speeds  at  different  times.  Even  if  the  underlying  hardware  is  synchronous  and 
uniform,  the  processes  may  appear  to  be  proceeding  at  different  and  varying  speeds.  This 
may  be  due,  for  instance,  to  uneven  loads  among  the  physical  processors  or  to  processes 
executing  instructions  that  have  different  execution  times. 

To  determine  the  possible  gains  note  that  the  most  rigid  execution  of  an  algorithm  is 
a  lock  step  execution  in  which  a  process  is  not  allowed  to  execute  the  ith  step  before  all 
processes  finish  step  i  —  1.  We  compare  asynchronous  executions  of  an  algorithm  to  the  lock 
step  execution,  ignoring  the  explicit  costs  of  the  synchronization  needed  to  achieve  lock  step 
execution.  We  call  the  added  complexity  resulting  from  lock  step  execution  the  implicit  costs 
of  synchronization.  We  do  not  define  the  notion  of  explicit  and  implicit  costs  formally,  for 
we  do  not  think  we  have  enough  experience  to  justify  definitive  definitions.  In  general,  we 
do  not  expect  an  algorithm  to  perform  better  when  process  speeds  are  non-uniform.  What 
we  seek  are  algorithms  whose  performance  degrades  only  modestly;  we  say  these  algorithms 
are  resilient  to  changes  in  the  environment.  We  anticipate  resilient  algorithms  will  perform 
significantly  better  that  their  synchronous  counterparts  in  asynchronous  environments. 

[CZ91a]  defines  the  APRAM  model  formally;  its  use  was  first  suggested  in  [KRS88a]. 
The  rounds  complexity  measure  was  used  to  measure  the  explicit  costs  of  synchronization. 
In  this  paper  we  introduce  the  variable  speed  APRAM,  an  APRAM  model  in  which  the 
speed  of  processes  may  vary.  We  define  two  complexity  measures  called  the  unbounded 
delays  measure  and  the  bounded  delays  measure.  They  model  the  non-uniformity  in  the 
environment  by  defining  the  duration  of  an  operation  using  a  random  variable  which  may 
take  values  larger  than  1.  The  imphcit  costs  of  synchronization  are  inferred  by  observing 
the  effect  of  varying  the  durations  on  the  complexity  of  the  algorithm.  In  addition,  we 
identify  several  characteristics  which  are  desirable  in  order  to  have  robust  measures.  These 
are  called  monotonicity,  scalability  and  convertibility.  We  discuss  these  properties  later.  For 


the  moment  it  suffices  to  remark  that  they  faciUtate  algorithm  analysis.  We  show  that  the 
unbounded  delays  measure  satisfies  all  three  properties.  This  is  not  true  of  the  bounded 
delays  measure,  which  only  satisfies  monotonicity;  even  so,  we  are  able  to  analyze  some 
intersting  algorithms  with  the  latter  measure. 

We  analyze  two  algorithms  under  the  new  model:  a  parallel  summation  algorithm  along 
an  implicit  tree  and  a  recursive  doubling  algorithm.  Both  are  fundamental  algorithms  and 
appear  as  subroutines  in  many  parallel  algorithms.  We  obtain  two  main  sets  of  results. 
In  Section  3,  we  show  that  under  the  unbounded  delays  measure,  in  asynchronous  envi- 
ronments, both  the  APRAM  summation  algorithm  and  the  APRAM  recursive  doubhng 
algorithm  perform  significantly  better  than  their  synchronous  counterparts,  in  a  sense  made 
precise  later.  Next,  in  Section  4,  we  obtain  similar  results  in  the  bounded  delays  measure. 
These  analyses  provide  an  unexpected  separation  result:  in  the  bounded  delays  measure, 
the  recursive  doubling  algorithm  performs  better  than  the  tree  based  summation  algorithm 
in  some  asynchronous  environments.  In  Section  2,  the  APRAM  model  and  the  associated 
complexity  measures  are  described. 

At  this  point  it  may  be  helpful  to  comment  on  our  contribution.  The  algorithms  we  ana- 
lyze are  simple  and  known  (or  only  minor  modifications  of  known  algorithms).  The  analyses 
are  fairly  straightforward  probabilistic  arguments  (the  main  non-obvious  element  arising  in 
the  analysis  of  the  recursive  doubling  algorithm,  where  our  analysis  is  time-reversed).  The 
main  contribution  of  the  paper  lies  in  the  results  themselves,  which  demonstrate  by  example 
the  advantage  to  be  gained  from  avoiding  the  implicit  costs  of  synchronization,  at  least  in 
certain  asynchronous  environments. 

1.1      Related  Work 

The  work  of  Nishimura  [Nis90]  is  motivated  by  similar  considerations  to  our  work.  She 
analyzes  the  performance  of  a  pointer  jumping  algorithm  in  her  model  and  proves  several 
simulation  results.  More  recently,  Anderson  and  Woll  described  a  wait-free  algorithm  for 
the  Union-Find  problem  [AW91];  the  advantage  of  a  wait-free  algorithm  is  that  no  slow  or 
failing  process  can  prevent  progress  indefinitely. 

Another  approach  to  the  problem  of  asynchrony  is  to  seek  to  efficiently  compile  PRAM 
algorithms  to  operate  in  asynchronous  environments;  this  approach  is  followed  by  Martel, 
Park  and  Subramonian  [MPS89,MSP90].  They  give  efficient  randomized  simulations  of  arbi- 
trary PRAM  algorithms  on  an  asynchronous  model,  given  certain  architectural  assumptions 
(e.g.  the  availability  of  a  compare&swap  instruction).  It  is  not  clear  whether  similar  deter- 
ministic compilers  exist;  in  the  absence  of  such  compilers,  to  obtain  deterministic  algorithms 


it  appears  necessary  to  design  them  in  an  asynchronous  environment;  this  is  the  focus  of 
our  work. 

Now,  we  briefly  discuss  some  of  the  work  concerned  with  uniform  environments.  [PU87], 
[PY88,AC88]  show  that  the  communication  among  the  processes  can  be  reduced  by  design- 
ing algorithms  which  take  advantage  of  temporal  locality  of  access;  they  assumed  that  each 
(global)  memory  access  has  a  fixed  cost  associated  with  it.  In  a  related  paper  [ACS89],  Ag- 
garwal  et  al.  argue  that  typically  it  takes  a  substantial  time  to  get  the  first  word  from  global 
memory,  but  after  that,  subsequent  words  can  be  obtained  quite  rapidly.  They  introduce  the 
BPRAM  model  which  allows  block  transfers  from  shared  memory  to  local  memory.  They 
show  that  the  complexity  of  algorithms  can  be  reduced  significantly  by  taking  advantage  of 
spatial  locality  of  reference.  This  approach  implicitly  assumes  that  the  (virtual)  machine 
is  uniform,  comprising  a  collection  of  similar  processes.  In  addition  to  latency  to  memory, 
Gibbons  [Gib89]  studied  the  cost  of  process  synchronization  by  considering  an  asynchronous 
model.  Although  he  assumed  that  the  machine  is  asynchronous,  his  analysis  in  effect  as- 
sumes that  the  processes  are  roughly  similar  in  speed.  More  precisely,  he  required  that  read 
and  writes  be  separated  by  global  synchronization;  this  results  in  a  uniform  environment 
from  the  process  perspective. 

2     The  APRAM  model 

The  RAM  is  a  very  popular  model  in  sequential  computation  and  algorithm  design.  There- 
fore, it  is  not  surprising  that  a  generalization  of  the  RAM,  the  PRAM,  became  a  popular 
model  in  the  field  of  parallel  computation.  We  use  another  generalization  of  the  RAM 
model,  called  the  APRAM,  suggested  by  Kruskal  et  al.  [KRS88a];  a  formal  definition  of 
the  APRAM  model  is  given  in  [CZ91a].  It  includes  the  standard  PRAM  as  a  submodel. 
The  APRAM  can  be  viewed  as  a  collection  of  processes  which  share  memory.  Each  process 
executes  a  sequence  of  basic  atomic  operations  called  events.  An  event  can  either  perform 
a  computation  locally  (by  changing  the  state  of  the  process)  or  access  the  shared  memory; 
accessing  the  memory  has  a  unit  cost  associated  with  it.  An  APRAM  computation  is  a 
serialization  of  the  events  executed  by  all  the  processes. 

The  parallel  runtime  complexity  associated  with  the  PRAM  model  describes  the  com- 
plexity of  PRAM  algorithms  as  a  function  of  the  global  clock  provided  by  that  model.  The 
APRAM  model  does  not  have  such  a  clock.  Consequently,  it  is  necessary  to  define  complexity 
measures  to  replace  the  running  time  complexity.  In  [CZ91a]  we  defined  one  such  complexity 
called  the  rounds  complexity:   it  replaces  the  global  clock  used  by  the  PRAM  model  by  a 


virtual  clock.  This  approach  was  introduced  in  [PF77]  and  used  in  [AFL83,LF81,KRS88b] 
and  is  common  in  the  area  of  distributed  computing  (see  [Awe87],  [Awe85,AG87]).  Consider 
a  computation,  C.  A  virtual  clock  ofC  is  an  assignment  of  unique  virtual  times  to  the  events 
of  C;  the  times  assigned  are  a  non-decreasing  function  of  the  event  number. 

The  virtual  clock  is  meant  to  correspond  to  the  "real"  time  at  which  the  operations  oc- 
curred in  one  possible  execution  of  the  algorithm,  called  a  computation.  The  time  difference 
between  two  consecutive  events  of  a  process  is  called  the  duration  of  the  later  event.  The 
length  of  a  computation  is  the  time  assigned  to  the  last  event  in  the  computation. 

The  rounds  complexity  of  an  algorithm  is  the  length  of  a  computation  maximized  over 
all  possible  computations;  i.e.,  maximized  over  all  possible  interleavings  of  the  events  of  the 
processes.  In  the  rounds  complexity  measure  of  [CZ91a],  we  assumed  that  the  duration  of 
events  was  at  most  one.  In  effect,  in  the  rounds  complexity  measure  the  slowest  process 
defined  a  unit  of  time  (a  round);  the  complexity  is  expressed  in  rounds.  This  normalizes 
the  complexity  relative  to  the  speed  of  the  slowest  process.  Therefore,  it  is  inadequate  for 
measuring  the  implicit  costs  of  synchronization;  to  capture  these,  we  allow  the  duration  to 
assume  values  larger  than  one. 

There  appear  to  be  several  ways  to  introduce  large  durations.  We  model  the  duration  as  a 
random  variable  as  follows.  We  define  the  variable  speed  APRAM.  Intuitively,  a  computation 
of  the  variabe  speed  APRAM  is  divided  into  rounds.  But,  unlike  the  rounds  complexity  we 
do  not  require  each  process  to  execute  at  least  one  event  in  each  round.  Rather,  for  each 
process  and  for  each  round  the  process  has  a  certain  probability  of  executing  an  event  in  that 
round.  We  then  consider  all  possible  computations  and  associate  with  each  computation  a 
probability. 

More  foramlly,  the  computations  of  this  model  are  specified  with  the  help  of  an  adversary 
A.  First,  a  pseudo-computation  C  is  created.  It  is  an  interleaving  of  pseudo-events,  the 
interleaving  being  specified  hy  A:  A  labels  each  pseudo-event  with  a  distinct  time,  a  real 
number.  In  addition,  A  is  allowed  to  eliminate  pseudo-events  by  means  of  a  probabilistic 
game;  the  exact  rules  of  the  game  depend  on  the  complexity  measure  being  considered.  The 
pseudo-events  remaining  after  this  elimination  form  the  events  of  the  computation.  Note 
that  one  adversary  may  create  many  computations;  each  computation  C  has  an  associated 
probability  Pr^(C),  namely  the  probabiUty  that  it  is  created  by  the  elimination  game. 
The  requirement  on  pseudo-events  is  that  their  duration  be  at  most  one  (the  duration  of  a 
pseudo-event  e,  associate  with  process  id,  is  the  time  from  the  pseudo-event  of  id  preceding 
e,  if  any,  to  the  time  of  e;  if  there  is  no  preceding  pseudo-event,  the  duration  is  just  the  time 
associated  with  e).   The  duration  of  events  is  defined  analogously.    The  elimination  game 


may  create  events  with  duration  greater  than  1. 

Now,  we  define  two  yardsticks:  For  any  number,  k,  and  any  adversary,  A,  the  nontermi- 
nation  probability  of  algorithm  A  relative  to  adversary  A,  NT^{A,k)  is  the  probability  that 
a  computation  of  A  has  length  at  least  k.  The  expected  length  of  a  computation  relative  to 
adversary  A,  E^(A),  is  then 

EAA)  =  ^NTAA,j). 

Given  any  algorithm,  A,  and  any  number,  k,  the  nontermination  probability  of  A, 
NT{A,k)  is  the  maximum  over  all  possible  adversaries.  A,  of  the  nontermination  proba- 
bility of  A  relative  to  A.  In  other  words: 

NT(A,  k)  =  max  NT^iA,  k). 
A 

The  expected  complexity  of  A  is  the  maximum  over  all  adversaries.  A,  of  the  expected 
complexity  of  A  relative  to  A.  In  other  words, 

E{A)  =  m&xEAiA). 

A 

3      The  Unbounded  Delays  Measure 

The  unbounded  delays  measure  has  a  single  parameter,  p.  Given  a  round,  r,  and  a  process, 
id,  the  parameter  p  roughly  corresponds  to  the  probability  that  process  id  does  not  execute 
an  event  in  round  r. 

We  have  the  following  scenario  in  mind.  Typically,  the  number  of  processes  used  by  the 
program  substantially  exceeds  the  number  of  available  processors,  requiring  each  processor 
to  execute  many  processes  concurrently.  The  load  of  the  different  processors  may  not  be 
balanced;  even  if  the  processes  are  initially  distributed  equally  among  all  the  processors, 
in  a  dynamic  computation  it  may  be  difficult  to  maintain  a  balance  among  the  processors. 
Furthermore,  some  processes  may  be  suspended,  waiting  on  an  external  event  such  as  an 
I/O  operation  or  an  interrupt. 

As  described  earlier,  the  computations  are  obtained  with  the  aid  of  an  adversary.  The 
adversary  determines  an  interleaving  of  objects  called  pseudo-events.  A  subsequence  of  the 
pseudo-events  forms  the  computation,  a  sequence  of  events.  The  adversary  associates  a 
distinct  time  with  each  pseudo-event.  For  each  process,  its  sequence  of  pseudo-events  have 
strictly  increasing  times.  Pseudo  events  are  eliminated  as  foUows.  The  adversary  associates 
a  probability  p,  with  pseudo-event  e,.  To  determine  if  e,  is  eliminated,  the  following  coin 
is  tossed:  it  has  probability  p;  of  showing  dummy  and  probability  1  -  p,  of  showing  real.  If 


the  coin  shows  dummy,  e,  is  eliminated.  The  remaining  pseudo-events  are  the  events;  they 
form  the  computation.  Each  successive  event  of  a  process  indicates  when  the  next  step  of 
its  program  is  to  be  performed. 

For  each  integer,  r,  round  r  of  the  computation  is  the  subsequence  containing  all  the 
events  with  time  field  [t\  =  r. 

Let  C  be  any  computation  obtained  as  above.  Define  Pr^(C),  the  probability  of  obtaining 
C  given  the  adversary  A-,  to  be  the  combined  probability  of  aU  the  choices  which  led  to  the 
computation.  The  adversary  is  called  conforming  if  for  each  round,  r,  and  each  process, 
id,  the  probability  that  round  r  does  not  have  a  tuple  with  identifier  id  is  at  most  p.  We 
consider  only  conforming  adversaries. 

Notation.  For  any  algorithm.  A,  let  NTp{A,  k)  (resp.  Ep(A))  be  the  nontermination  prob- 
ability (resp.  the  expected  complexity)  of  A  under  the  unbounded  delays  measure  with 
parameter  p. 

3.1      Superevents 

It  is  often  more  convenient  to  describe  algorithms  in  terms  of  higher  level  constructs  called 
superevents.  For  any  algorithm.  A,  and  any  number,  r,  let  A^  denote  algorithm  A  specified 
in  terms  of  superevents,  where  each  superevent  comprises  at  most  r  events.  A  superevent, 
e,  is  said  to  have  executed  within  superround  t,  for  some  superround  number  /,  if  all  the 
events  comprising  the  superevent  are  within  the  superround. 

As  before,  we  obtain  a  computation  of  A^  with  the  aid  of  an  adversary.  Although 
the  algorithm  is  specified  in  terms  of  superevents,  a  computation  is  defined  as  a  sequence 
of  events.  The  usefulness  of  the  definitions  will  become  clear  when  we  analyze  the  two 
algorithms  in  later  sections.  The  adversary  for  A  and  for  A'^  differ  only  in  the  definition  of 
conformity.  An  adversary  for  /I  is  a  conforming  adversary  if  the  probability  that  a  round 
does  not  contain  at  least  one  event  is  at  most  p.  An  adversary  for  A^  is  said  to  be  conforming 
if  the  probabihty  that  a  round  (or  in  this  case  a  superround)  does  not  contain  a  superevent 
is  at  most  p. 

As  each  superevent  comprises  at  most  r  events,  a  round  which  contains  at  least  2r  —  1 
events  of  a  given  process  contains  at  least  one  superevent  of  that  process.  An  adversary  of 
A^  is  conforming  if  for  every  round,  t,  and  every  process,  id,  the  probability  that  round  t 
contains  at  least  2r  —  1  events  is  at  least  1  —  p. 


3.2      Robustness  Criteria 

In  order  to  obtain  a  robust  measure  we  require  it  to  observe  three  properties.  The  first 
is  monotonicity:  intuitively,  we  require  that  when  the  speed  of  the  processes  is  increased 
the  complexity  does  not  increase.  Translated  to  our  measure,  we  require  that  when  the 
probability  that  a  round  does  not  contain  an  event  of  a  given  process  is  decreased,  the 
complexity  does  not  Increase. 

The  second  property  is  scalability.  Scalability  refers  to  the  property  that  the  complex- 
ity of  the  algorithm  analyzed  in  terms  of  superevents  is  within  a  constant  factor  of  the 
complexity  of  the  algorithm  analyzed  in  terms  of  events. 

The  third  property  is  convertibility.  It  defines  the  relationship  between  results  obtained 
under  the  unbounded  delays  measure  with  different  parameters. 

In  the  next  three  lemmas  we  show  that  the  unbounded  delays  measure  observes  all  three 
properties. 

Lemma  3.1  (Monotonicity)  For  any  algorithm,  A,  and  any  two  probabilities,  p  and  q, 
if  p  <  q  then 

1.  Ej,{A)  <  E^iA). 

2.  For  any  number,  t,  NTp{A,t)  <  NTg(A,t). 

Proof.  Any  conforming  adversary  for  algorithm  A  under  the  unbounded  delays  measure 
with  parameter  p  is  also  a  conforming  adversary  for  algorithm  A  under  the  unbounded 
delays  measure  with  parameter  q.  D 

Lemma  3.2  (Convertibility)  For  any  algorithm,  A,  for  any  two  probabilities,  p  and  q, 
and  for  any  number,  c,  if  q  =  p'^  then 

1.  Ep<c-E,{A). 

2.  For  any  number,  t,  NTj,{A,tc)  <  NTg{A,t). 

Proof.  Let  A  be  any  conforming  adversary  for  A  under  the  unbounded  delays  measure 
with  parameter  p.  Define  an  adversary,  A'  as  follows:  A'  defines  the  same  interleaving  as 
A\  A'  associates  the  same  probability  as  A  to  each  pseudo-event;  but  where  A  associates 
time  t{e)  to  pseudo-event  c,  A'  associates  time  [t{e)/c\. 

For  any  process  identifier,  id,  and  for  any  round,  r,  let  Pr^{id,  r)  denote  the  probability 
that  process  id  will  not  have  an  event  in  round  r  given  adversary  A.  Note  that  Pr_yi{id,r)  < 


p.  Then 

c-l  c-1 

Pr^'(id,  r)  =  n  P'-Aid,  re  +  i)  <  J^p  =  p'  =  q. 

1=0  1=0 

Therefore,  A'  is  a  conforming  adversary  for  algorithm  A  under  the  unbounded  delays  mea- 
sure with  parameter  q.  For  adversary  A',  algorithm  A  has  expected  rounds  complexity 
\Ep/c];  the  first  result  follows.  Furthermore,  for  any  number,  t,  if  the  probability  that  A 
does  not  terminate  in  t  rounds  under  adversary  ^  is  pi,  then  the  probability  that  A  does  not 
terminate  in  [t/c\  rounds  under  adversary  A'  is  at  least  pi;  the  second  result  follows.        D 

Corollary  3.1    For  any  algorithm.  A,  and  any  integer,  /  >  1, 

;.  Ep,{A)<Ej>{A)<l-Ep,iA). 

2.   For  any  t  >  I,  NTpi(A,l-t)  <  NTp{A,l-t)  <  NTj,,{A,t). 

Proof.    The  inequalities  on  the  left  follow  from  Lemma  3.1,  the  ones  on  the  right  from 

Lemma  3.2.  D 

Recall  that  A^  denotes  algorithm  A  specified  in  terms  of  superevents,  where  each  su- 
perevent  comprises  at  most  r  events. 

Lemma  3.3  (Scalability)  For  any  algorithm,  A,  for  any  number,  r,  and  any  two  proba- 
bilities, p  and  q,  if  q  =  1  —  (1  —  p)''^'""^'  then 

1.  Ep{A)<{2r-\)E,(A'). 

2.  For  any  number  t  >  I,  NTp(A,(2r  -  l)t)  <  NTq{A\t). 

Proof.  Let  A  be  any  conforming  adversary  for  A  under  the  unbounded  delays  measure 
with  parameter  p.  Define  an  adversary.  A'  as  follows:  A'  defines  the  same  interleaving  as 
A;  A'  associates  the  same  probability  as  A  to  each  pseudo-event;  but  where  A  associates 
time  i{e)  to  pseudo-event  c,  A'  associates  time  [<(e)/(2r  —  1)J. 

For  any  number,  s,  consider  the  segment  of  27-  —  1  rounds  of  A  starting  from  round 
s{2r  —  1)  and  ending  with  round  (s  -|-  l)(2r  -  1)  —  1.  The  probability  that  this  segment 
contains  at  least  one  event  from  each  of  the  2r  —  1  rounds  is  at  least  (1  — p)*^"""^'.  Therefore, 
the  probability  that  round  s  of  A'  contains  at  least  2r  —  1  events  is  at  least  (1  —  p)*^''"''  = 
I  —  q.  Thus,  A'  is  a  conforming  adversary  of  A^  under  the  unbounded  delays  measure  with 
parameter  q. 

For  adversary  A\  algorithm  A  has  expected  rounds  complexity  \Ep/{2r  —  1)];  the  first 
result  follows.  Furthermore,  for  any  number,  <,  if  the  probability  that  A  does  not  terminate 
in  t  rounds  under  adversary  A  is  pi,  then  the  probability  that  A^  does  not  terminate  in 
[t/(2r  -  1)J  rounds  under  adversary  A'  is  at  least  pi;  the  second  result  follows.  D 


Corollary  3.2  For  any  algorithm,  A,  any  probability,  0  <  p  <  1,  and  any  integer,  r  >  1, 
l€tq=l-(l  -pf^-^.   Then, 

1.  Ep{A')  <  EpiA)  <  (2r  -  l)E,iA'). 

2.  For  any  t>  1,  7VTp(/l^(2^-  1)0  <  NTj,{A,{2r  -  \)t)  <  NTg{A\t). 

Proof.  The  inequalities  on  the  left  are  trivial  lower  bounds.  D 

3.3      The  Synchronous  Variation 

Recall  that  the  purpose  of  the  probabilistic  complexity  measure  is  to  assess  the  reduction 
in  the  implicit  costs  for  synchronization  of  the  asynchronous  algorithms  compared  with 
their  synchronous  counterparts.  To  this  end,  we  introduce  a  modified  synchronous  model, 
whose  performance  is  compared  to  that  of  the  asynchronous  model;  we  call  this  model  the 
synchronous  variation.  In  the  synchronous  variation,  in  addition  to  being  synchronous,  a 
computation  is  subject  to  the  same  delay  pattern  as  an  asynchronous  computation.  However, 
in  the  modified  model  no  cost  is  charged  for  expUcit  synchronization,  and  thus,  the  difference 
in  complexity  between  an  algorithm  in  the  synchronous  variation  and  in  the  corresponding 
asynchronous  variation  provides  some  indication  of  the  reduction  in  the  implicit  costs  of 
synchronization  yielded  by  the  asynchronous  model. 

More  formally,  consider  an  algorithm,  A,  and  consider  any  computation,  Ca-,  of  A.  A 
synchronous  computation  of  A  under  the  unbounded  delays  measure  with  parameter  p  is 
obtained  as  follows.  A  pseudo-computation  is  created  with  exactly  one  pseudo-event  per 
process  for  each  round.  Then  the  adversary  attaches  a  probability  p(e)  to  each  pseudo-event 
e.  As  before,  p{e)  is  the  probability  that  e  is  eliminated;  the  requirement  is  that  p(e)  <  p. 
Again,  the  surviving  events  are  the  events  of  the  computation.  Now,  an  event  may  be 
a  busy-wait,  required  to  await  the  (implicit)  synchronization  which  ends  each  step  of  the 
synchronous  computation. 

Consider  an  algorithm,  A,  with  N  processes.  In  each  time  step  of  the  synchronous  com- 
putation, each  process  must  execute  exactly  one  step.  Therefore,  the  expected  number  of 
rounds  comprising  a  single  step  of  the  computation  is  the  maximum  of  N  independent  ran- 
dom variables.  For  the  unbounded  delays  measure  with  parameter  p,  the  expected  duration 
of  each  step  is  log^/pN  =  log  TV/ Iog(l/p).  For  any  time  step,  /,  let  Nt  be  the  number  of 
processes  at  that  time  step.  Then,  the  expected  number  of  rounds  required  for  time  step,  t, 
is  logiV,/log(l/p). 

Now,  we  give  two  simulations  for  the  synchronous  variation  that  allow  us  to  ignore 
constants  in  the  probabilities  and  to  analyze  algorithms  in  terms  of  superevents  rather  than 
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events. 


Lemma  3.4  For  any  algorithm,  A,  any  probability,  p,  and  any  number,  c, 

^log(l/p^)  ~  c'^\og{l/p) 


=     I'EpiA)- 

Let  A''  be  the  algorithm  A  defined  in  terms  of  superevents,  each  superevent  comprising 
exactly  r  events.  Then 

Lemma  3.5   For  any  algorithm,  A,  and  any  probability,  p,  Ep(A)  =  r  ■  Ep(A^). 

Proof.  Follows  from  the  additivity  of  expectations.  D 

3.4      Summation  Algorithm 

The  summation  algorithm  iterates  the  following  superevent  comprising  a  condition  test  plus 
a  possible  execution  of  the  if  statement. 

Algorithm  for  process  i: 

1  while  (V^  is  not  valid)  do 

2  if  R,  and  L,  are  valid  then 

3  set  V,  :=  L,  +  R, 

4  set  the  tag  of  ¥{  to  valid 

5  end  if 

6  end  while 

The  summation  algorithm  comprises  log  n  levels  of  superevents;  henceforth  we  call  su- 
perevents events. 

Under  the  synchronous  variation,  none  of  the  processes  at  level  i  +  1  can  proceed  before 
all  the  processes  at  level  i  finish  their  computation.  Furthermore,  there  are  n/'2'  processes 
at  level  i  (the  leaves  are  considered  to  be  at  level  0).  Let  i  be  given  and  let  P  be  a  process 
associated  with  an  internal  node  at  level  i.  Let  Ni  be  the  expected  number  of  rounds  it  takes 
to  compute  the  ith  level  of  the  computation  tree.  Since  all  the  probabilities  are  assumed  to 
be  independent,  if  there  are  m  nodes  at  level  i,  the  expected  number  of  rounds  required  by 
the  level  is  0(logm/log(l/p)).  The  expected  number  of  rounds  for  the  entire  computation 
is  the  sum  over  all  i  of  the  expected  number  of  rounds  for  level  i;  this  is  ©(log'^  n/  log{l/p)). 
When  p  =  i   for  instance,  this  becomes  0(log^  n). 

Let  us  consider  the  asynchronous  case.  The  events  of  the  computation  are  of  two  types. 
If  at  least  one  of  the  children  of  the  node  associated  with  the  event  is  not  valid,  the  else 
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clause  of  the  if  statement  is  executed  and  the  event  is  called  a  waiting  event.  Otherwise,  an 
event  validates  a  node  and  is  called  the  validating  event. 

In  order  to  simplify  the  analysis  we  consider  an  auxiliary  graph  comprising  n  paths, 
one  path  corresponding  to  each  path  in  the  implicit  tree  from  a  leaf  to  the  root.  Given  a 
computation  on  the  tree,  we  define  an  event  to  occur  at  a  node  v  on  the  path  if  an  event 
occurs  at  the  corresponding  node  in  the  tree  (in  general,  an  event  wiU  occur  at  nodes  on 
several  paths  simultaneously).  The  event  on  the  path  is  validating  if  its  only  child  is  valid 
(initially  the  leaves  are  the  only  valid  nodes).  An  easy  induction  shows  that  the  computation 
at  a  node  v  in  the  tree  is  validated  exactly  when  all  the  corresponding  nodes  on  the  paths 
are  validated.  Now,  we  can  overestimate  the  non-termination  probability  for  the  algorithm 
by  multiplying  by  n  the  non-termination  probability  for  a  single  path.  (A  very  similar 
argument  is  due  to  Luby  [Lub88].) 

So  the  probability  that  some  path  requires  at  least  c-k  rounds  to  be  completely  validated, 
assuming  c  >  1  and  k  >  logn,  is  given  by: 

log  n)  ~ 

<     1/2'^ 

c  +  2 

assuming  that  p  <2    <:-i . 


Taking  c  =  4  we  have  shown: 

Theorem  3.1   For  any  j  >  \ogn,  if  p  <  |,  the  summation  algorithm  terminates  in  at  most 
Aj  rounds  with  probability  at  least  I  —  jj. 

This  can  be  generalized  for  larger  values  of  p  using  Lemma  3.2  yielding: 

Corollary  3.3   For  any  j  >  logn,  and  any  I  >  1,  if  p  <  (t)^^',  the  summation  algorithm 
terminates  in  at  most  4jl  rounds  with  probability  at  least  1  —  ^. 

For  example,   when  p  =   i,   the  algorithm  terminates  in  at  most  8  j  logn  rounds  with 
probability  at  least  1  — \. 

Corollary  3.4   For  any  number,  I  >   I,  if  p  <  (^)'^',   the  expected  number  of  rounds  the 
summation  algorithm  executes  is  at  most  4/ logn  -f-  o(l). 
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3.5      Recursive  Doubling 

Unlike  the  summation  algorithm,  which  was  performed  along  an  implicit  binary  tree,  in  the 
recursive  doubling  algorithm  each  process  proceeds  independently  of  the  other  processes. 
Without  loss  of  generality  we  assume  that  the  input  comprises  a  single  linked  list  of  elements. 

Under  the  synchronous  variation  the  recursive  doubling  algorithm  behaves  very  similarly 
to  the  summation  algorithm.  It  proceeds  in  [logn]  levels.  A  computation  associated  with 
level  i  does  not  proceed  before  all  the  computations  of  level  i  —  1  have  terminated.  The 
expected  complexity  of  the  algorithm  is  then  ©(log'^  7i/log(l/p)).  For  p  =  i,  for  instance, 
this  is  0(log^  n). 

We  turn  to  the  asynchronous  case.  Consider  any  probabilistic  computation,  C,  of  the 
recursive  doubling  algorithm.  We  show  that,  with  high  probability,  the  algorithm  terminates 
in  O(logn)  rounds.  Recall  that  each  process  is  associated  with  one  element  of  the  list.  For 
any  element,  t;,  any  round,  r,  and  any  event,  e,  we  say  that  the  event  e  is  seen  by  v  at  round 
r  if  the  process  associated  with  v  executed  event  e  at  round  r. 

The  algorithm  can  be  viewed  as  running  on  an  infinite  list  consisting  of  the  first  n  —  I 
elements  followed  by  the  head  of  the  list  repeated  indefinitely.  Recall  that  for  any  pair  of 
elements,  u  and  v,  D{u,  v)  is  the  distance  from  u  to  v  and  is  defined  to  be  the  number  of  links 
between  u  and  v  in  the  input  list.  Let  V  be  the  set  comprising  the  first  n  -  1  elements  of  the 
list  and  let  pkiv)  be  the  successor  of  v  after  the  A;th  round.  The  algorithm  terminates  when 
the  first  n  —  1  elements  of  the  list  point  to  elements  at  distance  at  least  n  —  I.  Therefore, 
we  wish  to  compute  the  probability 


Pr 


min{D{v,pk{v))}  <  n-  1 


Note  that  regardless  of  the  events  seen  by  v  at  round  /:  +  1,  D{v,pk^i{v))  >  D(v,pk(v)). 

Recall  that  each  iteration  of  the  algorithm  comprises  two  operations:  a  read  operation 
in  which  a  node,  t;,  reads  the  current  parent  of  its  parent,  and  an  update  operation  in  which 
the  node  replaces  its  current  parent  by  the  new  value  it  read. 

We  analyze  the  algorithm  with  the  aid  of  a  computation  DAG  defined  as  follows.  The 
processes  (or  list  elements)  are  listed  along  the  x  axis  and  the  virtual  time  along  the  y 
axis.  There  is  a  vertical  line  going  through  each  element  point  on  the  x  axis;  this  represents 
the  element's  history;  the  vertical  lines  are  assumed  to  be  directed  towards  the  x  axis  (i.e., 
towards  the  past). 

For  any  event,  e,  let  t(e)  be  the  virtual  time  assigned  to  e  and  let  P(e)  be  the  process 
which  executed  e.  <(e)  represents  the  virtual  time  at  which  superevent  e  was  completed. 
Let  t~(e)  be  the  (virtual)  time  when  e  executed  the  read  operation.  This  time  is  not  given 
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explicitly  by  the  analysis  but,  unless  e  is  the  first  event  of  the  process,  it  is  some  time 
between  t{e)  and  the  end  of  the  previous  event  executed  by  P{e).  Let  e  be  any  event,  let  v 
be  the  element  associated  with  e,  and  let  w  be  the  parent  of  v  at  time  t{€~).  Add  a  directed 
edge  to  the  graph  from  the  point  {v,t{e))  to  the  point  {w,t~{e)). 

For  any  element,  x,  define  the  execution  tree  of  C  relative  to  x  as  foUows.  The  root  of 
the  tree  is  the  point  (ar,|C|),  where  \C\  represents  the  length  of  the  computation.  For  any 
point,  {e,r),  the  children  of  (e,  r)  are  those  points  at  level  r  —  1  which  are  on  a  path  from 
(e,  r).  In  other  words,  for  any  integer,  r,  if  /•  =  0  then  (e,  r)  is  a  leaf;  otherwise,  the  children 
of  (e,  r)  are  {  ( j,  r-  -  1)  |  {j,r  -  1)  is  on  a  path  from  (e,  r)  }. 

The  execution  tree  has  the  following  structure:  Let  r  be  any  round  number.  A  point 
(i,r)  has  only  one  child  if  and  only  if  process  i  did  not  execute  an  event  at  round  r;  this 
can  happen  with  probability  at  most  p.  Otherwise,  (i,?")  has  at  least  two  children.  For  any 
i  and  r,  let  XI  be  the  number  of  leaf  descendants  of  (i,  r)  in  the  computation  tree.  Then 

Lemma  3.6  For  any  i  and  I,  X-  <  D{i,pi(i)). 

Proof.  FoUows  by  induction  from  the  definition.  D 

The  analysis  is  reduced  to  finding  a  number  k  and  showing  that,  with  high  probability, 
for  any  element,  x,  (i,  k)  has  at  least  n  leaf  descendants  in  the  computation  tree  relative  to 
X.  In  fact, 

Theorem  3.2  Given  p  <  |,  and  for  any  element,  x,  if  \C\  =  t  >  2[log3/2  n]  then  Fr(A'^  < 
n]  <(§)'. 

Theorem  3.2  states  that,  for  any  given  element,  x,  after  t  >  2  log3/2  n  rounds  the  probability 
that  X  does  not  point  to  the  end  of  the  list  is  at  most  (§)'■  The  array  has  n  elements, 
therefore,  the  probability  that  after  t  rounds  there  is  an  element,  x,  for  which  p(x)  is  not 
the  head  of  the  list  is  at  most  n(|)'. 

Theorem  3.3  Given  p  <  §  and  for  any  computation  of  the  recursive  doubling  algorithm, 
after  t  >  2  [log3/2  n]  rounds  every  element  in  the  list  points  to  the  end  of  the  list  with 
probability  at  least  n(|)'. 

This  can  be  generahzed  for  larger  values  of  p  using  Lemma  3.2  yielding: 

Corollary  3.5  For  any  /  >  1,  if  p  <  (|)*/',  after  t  >  2/|"log3/2n]  rounds  every  element  in 
the  list  points  to  the  end  of  the  list  with  probability  at  least  n(|)T. 

And  finally, 
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Corollary  3.6  For  any  I  >  1,  if  p  <  (hV    >  ^^e  expected  length  of  a  computation  of  the 
recursive  doubling  algorithm  is  at  most  2l\og^,2n  +  o(l). 

Proof.  The  expected  length  of  a  computation  of  the  recursive  doubling  algorithm  is: 

J  J<2nog3^2n  J>2/log3y2" 

2\t 


<     2/log3/2n+        Yl        "(^) 
=     2/log3/2n  +  o(l). 


Proof  of  Theorem  3.2.  Let  C  be  any  computation  and  let  x  be  any  element  in  the  list. 
Consider  the  computation  tree  (of  C)  relative  to  x.  For  any  i  and  /,  let  Y{iJ)  denote  the 
number  of  children  of  (?',/)  in  the  computation  tree.  The  probability  that  a  process  does 
not  execute  an  event  in  a  given  round  is  independent  of  aU  other  events  (by  the  measure 
assumption).  Therefore,  the  variables  Y(-,-)  treated  as  random  variables,  are  independent 
and  identically  distributed:  For  any  i  and  /,  Y{i,l)  =  1  with  probability  p  and  Y'(ijl)  >  2 
with  probability  1  —  p. 

Consider  the  tree  from  the  root  down.  Let  /?,  0  <  /?  <  1,  be  a  constant  and  for  any 
number,  <,  let  C'  denote  the  elements  at  level  t  of  the  tree.  We  call  level  /  of  the  tree 
productive  if  at  least  0\C*\  of  its  elements  have  two  children.  Note  that  if  C'  is  productive 
then  |C~^|  >  (1  +  /9)|C|.  There  is  one  element  at  level  k\  therefore,  if  at  least  no§(i+/3)  "1 
levels  of  the  tree  are  productive  then  the  tree  has  at  least  n  leaves. 

Productive  Probability.  Recall  p,  the  probability  that  a  node  has  only  one  child.  Let  / 
be  any  number.  Define  qi  to  be  the  probabihty  that  level  t  is  not  productive  given  that  it 
has  /  nodes.  Then 


qt  =  Pr  [|C'-'|  <{l+fi)-l\  \n  =  l]  <  (^^^  _  ^^  ^  ^y'-"^'^  <  2y^-^"  =  (2P 


1-/3  W 


1 
If  p  <  (^)'-^,  the  probability  of  a  level  being  unproductive  shrinks  exponentially  with  the 

number  of  elements  in  the  level.   For  0  —  ^  and  p  <  |,  9i  =  p,  92  =  P^  <  Pi  93  ^  3p^  <  p, 

94  <  4p^  <  P  and  95  <  lOp^  <  p.   Setting  (3  -  ^  and  restricting  p  <  |  <  i^)^^  =  \,  the 

probability,  9,  that  a  level  is  not  productive  is  then 

q  <  max{9;}  <  max  |p,max{(2pi-'')'}l  <  max{p,  (2p^"'^f }  =  p. 
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Now,  assume  that  the  root  is  at  level  t  and  that  t  >  21og3/2"-   Let  Q(tJ)  denote  the 
probability  that  at  least  /  levels  of  a  computation  tree  with  t  levels  are  not  productive. 

Pr[|C°|  <n|<>21og3/2n]     <     Q{t,t -log^^^^lt  >  ^log^/^n) 

\t  -  log3/2  nj 

On  restricting  p  <  |,  we  get  that  4g  <  4p  <  |  and  thus 

'2\' 


Pr[iC°|<n|i>21og3/2n]  <  Q) 


D 


4      The  Bounded  Delays  Measure 

The  bounded  delays  measure  has  two  parameters:  p,  the  probability  of  being  delayed,  and 
A;,  the  length  of  the  longest  duration.  We  have  the  following  scenario  in  mind.  In  a  multiuser 
parallel  environment  the  operating  system  may  need  to  preempt  a  process  for  a  relatively 
long  time  but  not  long  enough  to  justify  reallocating  the  work  it  was  performing.  In  a  well 
designed  system  we  expect  such  interference  to  be  infrequent.  We  capture  such  behavior  in 
the  bounded  delays  model  with  parameters  p  and  k:  the  probability  that  the  duration  is 
larger  than  1  is  at  most  p  and  the  duration  of  an  event  is  bounded  by  Ic. 

As  for  the  unbounded  delays  measure,  a  probabilistic  computation  for  the  bounded  delays 
measure  is  obtained  with  the  aid  of  an  adversary.  The  adversary  creates  an  interleaving  of 
the  pseudo-events  as  before,  by  attaching  a  distinct  real  time  to  each  pseudo-event;  also,  it 
associates  a  probability  of  elimination  with  each  pseudo-event,  as  well  as  an  integer  future 
delay.  The  future  delay  is  bounded  by  A;. 

Given  an  adversary.  A,  in  order  to  obtain  a  computation  we  start  with  an  infinite  se- 
quence, S  =  eo,ei, . . .,  of  pseudo-events,  where  each  process  has  at  least  one  pseudo-event 
in  each  round  (a  unit  of  time).  The  computation  is  obtained  in  two  steps:  For  each  process, 
the  first  step  (Init)  defines  the  delay  before  the  first  event  of  the  process.  The  second  step 
(Iter)  defines  the  durations  of  all  other  events.  The  need  for  the  first  step  will  become  clear 
in  the  lower  bound  discussion  of  Section  4.1  (see  Lemma  4.1  (p.  20)). 

Init:  Let  id  be  any  process.  Let  eo  be  the  first  pseudo-event  for  process  id.  Let  po  and  do 
be  respectively  the  probability  and  delay  associated  with  cq  by  the  adversary.   Toss 
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a  coin  which  has  probability  pi  =  1^°^  [if-i)  °^  showing  delayed  and  a  probability  of 
I  —  pi  —  — — }  _  of  showing  normal.  If  the  coin  shows  delayed,  pick  a  number,  m, 
between  1  and  do  -  1  uniformly  at  random  and  eliminate  the  first  m  pseudo-events  of 
process  id. 

Iter:  Let  e,  denote  the  zth  pseudo-event  of  process  id.  Let  p,  and  di  be  respectively  the 
probability  and  delay  associated  with  e,  by  the  adversary.  If  e,  has  not  been  eliminated 
(it  can  be  eliminated  either  in  an  application  of  step  Init  or  in  an  earlier  application  of 
this  step)  toss  a  coin  which  has  probability  p,  of  showing  delayed  and  probability  1  -p, 
of  showing  normal.  In  either  case  pseudo-event  i  remains  in  the  sequence.  However,  if 
the  coin  shows  delayed,  the  d{i)  —  1  pseudo-events  of  process  id  immediately  following 
pseudo-event  i  are  eliminated.  This  corresponds  to  a  long  duration  for  the  event  of 
process  id  following  event  Cj. 

The  remaining  pseudo-events  are  events  and  comprise  the  computation. 

As  before,  the  probability  of  a  computation  C  is  the  probability  that  it  arises  in  the 
above  elimination  game.  The  nontermination  probability  and  the  expected  length  of  a 
computation  are  defined  as  before.  We  require  the  adversaries  to  be  conforming,  that  is,  for 
any  round,  r,  and  for  any  process,  id,  we  require  that  the  probability  that  process  id  starts 
a  delay  at  round  r  is  at  most  p. 

The  bounded  delays  model  is  not  as  robust  as  the  unbounded  delays  model.  The  first 
property,  monotonicity,  holds,  but  the  other  two  do  not.  Therefore,  we  must  analyze  the 
algorithms  using  events  and  rounds  rather  than  superevents  and  superrounds.  The  com- 
plexity bounds  we  obtain  here  are  not  obviously  tight,  so  we  provide  lower  bounds  on  the 
performance  of  the  algorithms  as  well. 

4.1      Summation  Algorithm 

Now,  we  consider  the  summation  algorithm  under  the  bounded  delays  measure  with  param- 
eters p  and  k.  We  start  by  analyzing  the  synchronous  variation. 

Consider  the  computation  of  time  step  i.  If  none  of  the  operations  forming  time  step 
i  are  delayed,  time  step  i  takes  one  round,  otherwise,  it  takes  k  rounds.  Therefore,  the 
number  of  rounds  to  complete  a  step  with  m  processes  has  the  foUowing  distribution: 

,     X  _   f   1     with  probability  (1  -  p)"* 
^^'  ~  \k     with  probability  1  -  (1  -  p)'" 

The  expected  number  of  rounds  it  takes  to  complete  a  step  involving  m  nodes  is  Em,  = 
k  —  (k  -  1)(1  —  p)*".  Each  level  of  the  computation  comprises  some  r  events,  r  a  constant. 
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The  expected  number  of  rounds  to  complete  the  whole  computation  is: 

log  n  - 1 


E 


=     r    J2    E2. 

t=0 
logn— 1 
=     r    X:    [k  -  ik  -  l)(l  -  pf] 

[logi-l 

=     rklogn  —  r{k  —  1) 


logn— 1 


t=0 


.=log  i 


>     rA;log  n  —  r(k  —  I) 


logn-l-log  i 

iog-+     E     ( 


„-2' 


>     r A:(log n  -  log  -)  +  r log r{k  -  I) - 

p  p  e  —  1 

When  p  =  n~'-  for  e  <  1 
=     fi(A;(l -e)logn) 

When  e  <  ^,  it  is  easy  to  check  that  the  underestimate  is  only  by  a  constant  factor.  Thus, 
if  p  >  If'n}!^  the  expected  rounds  complexity  for  the  synchronous  variation  is  0(A:logn). 

Now,  consider  the  asynchronous  case.  An  argument  identical  to  that  given  for  the 
unbounded  delays  scenario  shows  that  it  suffices  to  determine  the  complexity  of  computing 
n  paths  of  length  rlogn,  where  there  is  a  separate  process  for  each  node;  each  process 
performs  events  with  the  goal  of  validating  its  node,  v\  if  u's  child  is  valid,  then  v  can  be 
validated;  otherwise  its  process  executes  a  null  statement.  The  analysis  is  complicated  by 
the  fact  that  when  a  node  is  validated  its  parent's  process  may  be  in  the  middle  of  a  long 
delay.  It  is  convenient,  where  no  confusion  will  result,  to  refer  to  a  node  and  its  process 
interchangeably. 

Consider  all  the  nodes  on  a  path  P.  We  call  a  node  slow  if  its  process  is  in  the  middle 
of  a  delay  when  its  child  is  validated.  Let  q  be  the  probability  that  a  node  is  slow.  Then 
q  <  pk. 

The  algorithm  has  log  n  levels  of  superevents,  each  comprising  r  events.  For  any  two 
numbers,  a  and  /,  a  >  1  and  /  >  1,  the  probability,  P(q,/),  that  a  path  has  at  least  al  slow 
nodes  is 


PKo.(":r)."' 
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The  probability,  P„,(a,/),  that  there  is  a  path  with  at  least  al  slow  nodes  is  then  at  most 

Thus,  whenever  q  <  n"^,  Pmi^J)  <  n('"+')<i-'*). 

Theorem  4.1   For  any  two  numbers,  I,  and  a,  l,a  >  1,  if  pk  <  n~   '    ,  the  probability  that 
the  length  of  the  computation  exceeds  rlogn  +  kal  is  at  most  l/n*'""'"^'''*"''. 

Proof.   Suppose  every  path  has  at  most  al  slow  nodes.   Then  the  computation  completes 
in  at  most  r  log  n  -\-  k  •  al  rounds.  D 

Corollary  4.1   For  any  number,  I,  I  >  1,  if  pk  <  n~   '    ,  the  expected  length  of  a  computa- 
tion of  the  summation  algorithm  is  at  most  rlogn  +  2kl  +  0{kl/n^'^^). 

Proof.  The  expected  length  of  a  computation,  Ep{A),  is 

Ep(A)     =     ^7Vrp(/l,j)<rlogn  +  A:/  +  ^7VTp(A,rlogn  +  W  +  i) 

J>0  j>0 

<  rlogn +  kl  +  J2kl-  NTp{A,  rlog  n  +  A;/(l  +  j)) 

j>o 

<  r\ogn  +  kl(l  +  "^l/n^^'+^^) 

j>0 

<  rlogn +  2kl  +  0(^^^). 


Lower  Bounds.  Now,  we  consider  the  lower  bound  on  the  complexity  of  the  parallel 
summation  algorithm  under  the  bounded  delays  model.  Consider  a  single  path  P,  from  a 
leaf  to  the  root.  We  call  a  node  v  on  the  path  slow  if  when  u's  child  w  is  validated,  it  takes 
at  least  [{k—  l)/2j  +  1  rounds  before  v  completes  the  execution  of  its  next  event.  For  any 
number,  a  <  1,  and  for  any  node,  e,  we  say  that  e  is  a-late  if  there  is  a  path  P  from  a  leaf 
to  e  which  contains  at  least  aheight(e)  slow  nodes,  where  height{e)  denotes  the  height  of 
node  e  in  the  tree. 

Consider  the  uniform  adversary.  A:  each  process  is  given  one  pseudo-event  in  each  round 
and  to  each  pseudo-event  A  associates  a  probability  p  of  ehmination  and  a  delay  k.  We  show 
that  for  some  constant,  a,  the  root  of  the  tree  is  a-late  with  high  probability.  We  start  with 
some  preliminary  results. 

For  any  round,  r,  and  for  any  process,  /,  let  Qi{r)  be  the  probability  that  process  /  has 
an  event  at  round  r.  We  show: 
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Lemma  4.1   For  any  round,  r,  and  for  any  process,  I,  the  probability.  Qt{r),  that  process  I 
has  an  event  in  round  r  is 

Qlir)=  .^    \       ,,• 

1  +  p{k  -  1) 

Proof.  Consider  a  process,  /.  The  proof  is  by  induction  on  the  round  number,  r.  We  start 
with  the  induction  step:  If  r  >  k  then  /  has  an  event  in  round  r  in  one  of  two  cases. 

1.  If  /  has  an  event  in  round  r  —  1  then  /  has  an  event  in  round  r  if  and  only  if  it  tossed 
"normal"  at  round  r  -  1. 

2.  If  /  does  not  have  an  event  in  round  r  —  1  then  /  has  an  event  in  round  r  if  and  only 
if  process  /  has  an  event  at  round  r  —  k  and  it  tossed  '"delayed"  at  that  round. 

In  other  words,  if  r  >  A:  then 

Ql(r)  ^  p  ■  Qi{r  -  k)  +  {1  -  p)  ■  Q,{r  -  I). 

The  proof  is  concluded  by  showing  that  the  lemma  holds  for  all  r,  0  <  r  <  fc  -  1. 

Assume  that  r  <  k  —  I.  Then  the  only  way  in  which  process  /  can  have  an  event  in 
round  r  is  if  all  the  pseudo-events  of  process  /  with  round  number  less  than  r  which  were 
not  eliminated  by  the  Init  step  tossed  "normal."  The  probability  that  no  pseudo-events  were 
ehminated  in  the  Init  step  is  ^  tk-i)  ^^  design.  If  the  Init  step  eliminates  pseudo-events, 
then  for  any  number,  i,  I  <  i  <  k  —  I,  the  Init  step  eliminates  exactly  i  pseudo-events  with 
probability  ^     fk-i)-  Therefore,  if  r  <  A:  —  1 


^'(^)   =   YT^)-^'-'^'^YTl^).^J'-'^'~' 


i  +  p{k-i) 

This  concludes  the  proof  of  the  lemma.  D 

Let  Qs  be  the  probability  that  a  node  in  the  tree  is  slow.  Then 

Lemma  4.2  For  any  node,  e,  in  the  summation  tree,  the  probability,  qs,  that  e  is  slow  is 

K[(^--1)/21)^  1       p{k-\) 
9»  =  ~r~, — 71 — 7^  ^ 


l+p{k-l)     -  2(l+p(fc-l))' 
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Proof.  It  follows  from  Lemma  4.1  that  the  probability  that  a  process  did  not  have  an  event 
in  the  round  after  its  child  was  validated  is  rrj^zVj-  Given  that  the  process  was  waiting, 
following  the  validation  of  its  child  it  is  equally  likely  to  wait  any  number  of  rounds  between 
1  and  A;  —  1.  Therefore,  the  probability  that  it  had  to  wait  at  least  [(k  -  l)/2j  rounds  is 

pi\ik-l)/2]) 
l+p{k-l)  ■ 

as  desired.  D 

Let  m  =  irlogn  and  consider  all  the  nodes  of  the  tree  at  height  m.  We  compute  the 
probability  that  at  least  one  of  these  nodes  is  o-late,  for  some  number  q.  Note  that  there 
are  y/n  nodes  at  height  m  so  there  are  y/n  node  disjoint  paths  from  the  leaves  to  nodes  at 
height  m.  Consider  one  such  path,  P,  and  let  S  be  the  number  of  slow  nodes  on  P.  We 
use  the  Chernoff  bounds  [HR89]:  for  any  number,  e,  0  <  c  <  1,  the  probability  that  P  has 
fewer  than  (1  —  t)mqs  slow  nodes  is 

Pt[S  <  {\  -  e)mqs]     <     e-2'^""^' 


When  e  =  i, 


It  then  foUows  that 


Pr 


S  <  -mqs 


<  e-8 


mq. 


<  e 


X^   PC--'). 
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■1+P(*-1), 


Pt 


S  < 


m       p{k  —  1) 


4  (l+p(k-l)) 


<  Pr 


S  <  -mq. 


■  o.^  Pi*-') 
<  e    i6'"i+p<)c-i)_ 


Let  c  be  any  number,  c  <  1,  and  assume  that  p{k  -  1)  >  j-rj.  Then  ^v'  ij._\\  >  c.  It  follows 
that 

Tnp(  k  —  I) 


Pr 


S  < 


cm 


<     Pr 


S  < 


4{l+p{k-i)) 

1_        In  n  _       cr 

<        e      32'-    In  2     <    71      32  In  2. 


In  other  words,  for  each  node,  e,  at  level  m,  the  probability  that  e  is  not  |-late  is  at  most 
n~32in2 .    The  tree  has  y/n  nodes  at  level  m.    Thus,  the  probability  that  none  of  them  is 

cry/n 

T-slow  is  at  most  n~32in2. 

4 

Thus,  for  some  path,  its  computation  requires  at  least  rlogn  +  ""^^g~  '  =  rlogn(l  + 
fgik  -  1))  rounds  with  probability  at  least  1  -  l/n'"'^A^21n2)    ^Ve  have  shown: 
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Theorem  4.2   For  any  number,  c  <  \,  if  p{k  —  1)  >  jf^,  the  summation  algorithm  requires 
at  least  rlogn(l  +  -^{k  -  1))  rounds  with  probability  at  least  1  -  i/ri<:'-v^/(32ln2) 

In  particular,  if  the  number  c  is  a  constant,  we  get: 

Corollary  4.2  For  any  constant,  c,  c  <  I,  if  p{k  —  1)  >  j^,  the  expected  complexity  of  the 
summation  algorithm  is  Q{kr\ogn)  rounds. 

4.2      Recursive  Doubling 

Now,  we  consider  the  recursive  doubling  aJgorithm  under  the  bounded  delays  measure  with 
parameters  p  and  k.     Under  the  synchronous  variation  the  algorithm  proceeds  like  the 
summation  algorithm  in  rlogn  levels,  for  some  constant  r.  Therefore,  when  p  >  1/n^'^  the 
expected  rounds  complexity  for  the  synchronous  variation  is  Q{rklogn). 
The  main  result  of  this  section  is  stated  in  the  following  theorem: 

Theorem  4.3  If  p  <   1/3^'^  and  pk  <   1/18,  then  for  any  number,  c  >  2,   the  algorithm 
terminates  in  O(logn  +  A;(logc  +  log  log  n))  rounds  with  probability  at  least  1  —  l/n^*^"^. 

Corollary  4.3  If  p  <  1/3^"^  and  pk  <  1/18,  the  expected  number  of  rounds  the  algorithm 
executes  is  O(log  n  +  A;  log  log  n). 

The  analysis  of  the  recursive  doubling  algorithm  under  the  bounded  delay  model  is 
similar  to  the  analysis  given  for  the  unbounded  delays  model,  but  more  involved.  The 
mapping  of  the  computation  to  an  execution  tree  is  the  key  to  making  the  analysis  tractable. 
This  allows  complicated  cases  due  to  slow  processes  to  be  circumvented  by  pruning  the  tree; 
this  overestimates  the  number  of  rounds  required,  but  only  by  constant  factors. 

We  start,  as  before,  with  the  computation  DAG.  The  long  delays  appear  in  the  graph  as 
long  vertical  edges.  We  underestimate  the  progress  made  by  a  node  by  ignoring  the  progress 
it  made  before  a  long  delay.  We  achieve  this  by  trimming  the  DAG,  eliminating  all  the  long 
verticaJ  edges  and  all  cross  edges  whose  head  falls  on  a  long  vertical  edge. 

We  define  the  execution  tree  as  before;  however,  the  tree  may  contain  nodes  with  no 
children.  A  node  will  have  no  children  if  after  a  long  delay  it  executed  a  doubling  operation, 
jumping  over  a  node  which  was  in  the  middle  of  a  long  delay;  this  happens  with  probability 
9o  <  P^k-  A  node  will  have  1  child  in  one  of  two  cases:  If  after  a  long  delay  it  jumped  over  a 
"normal"  node  or  if  it  is  a  normal  node  which  jumped  over  a  node  which  is  in  the  middle  of  a 
long  delay.  Each  of  these  cases  happens  with  probability  at  most  pk;  so,  the  probability,  91, 
that  a  node  has  1  child  is  at  most  2pk.  Otherwise,  the  node  has  two  children;  this  happens 
with  probability  92  =  1  ~  (9o  +  9i)- 
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We  analyze  the  algorithm  using  the  productive  probability  as  before.  However,  we  have 
to  augment  the  argument  to  take  into  account  nodes  which  have  no  children.  Suppose  the 
execution  tree  has  a  level  with  M  nodes,  M  to  be  chosen  later.  When  M  is  sufficiently  large 
the  probability  that  the  level  is  productive  (i.e.,  the  probability  that  the  next  level  down 
has  a  constant  factor  more  nodes  than  the  current  level)  is  very  large.  We  underestimate 
the  termination  probability  by  computing  the  probability  that  all  sufficiently  large  levels  of 
the  tree  are  productive. 

For  any  number,  /,  we  call  a  level  with  I  nodes  nonproductive  if  either  it  has  at  least  1/5 
nodes  with  no  children  or  it  has  at  least  1/2  nodes  with  1  child.  It  follows  that  if  a  level  is 
productive  it  has  at  least  ^l  children. 

For  any  number,  M,  if  90  <  1/3^  then  the  probability  that  a  level  with  at  least  M  nodes 
has  at  least  M/5  nodes  with  no  children  is  at  most  (2/3)^^.  Similarly,  if  qi  <  1/9  then  the 
probability  that  a  level  with  at  least  M  nodes  has  at  least  M/2  nodes  with  1  child  is  at 
most  (2/3)    .  Thus,  the  probability  that  a  level  with  at  least  M  nodes  is  not  productive  is 

3/2 


at  most  2(|)^.  Setting  M  =  21og3/2  n,  we  get: 


Lemma  4.3   Given  p  <   2/27  and  pk   <   1/18,  for  /3   =    11/10  and  for  t   =    |"log^  n] ,    if 
Id  >  21og3/2n  then  Pr{|C°|  <  n]  <  ^  +  O(^). 

Proof.  Assume  that  |C'|  has  at  least  21og3/2n  nodes,  qo  <  p^k  <  1/3^.  Also,  q\  <  2pk  < 
1/9.  Starting  from  level  t  down,  if  the  first  /  levels  are  productive  then  level  t  -  I  has  at  least 
,5  21og3/2?i  elements.  Therefore,  the  probability  that  some  level  of  the  tree,  below  level  i, 
is  non-productive  is  at  most 


/ON  2/3'  log3/2  n  o 

Pr     <       X:   2  (^)  '      =    J:   2n-^^'  =  ^  +  0{n-^^). 


If  all  the  levels  are  productive,  the  tree  has  at  least  n  leaves.  D 

The  analysis  is  concluded  by  showing  that  some  level  of  the  tree  has  at  least  M  = 
21og3/2n  nodes.  Consider  the  computation  as  if  it  is  proceeding  in  phases,  each  comprising 
k  rounds  and  consider  the  untrimmed  computation  DAG  and  its  associated  execution  tree. 
Each  process  is  guaranteed  to  execute  at  least  one  operation  in  each  phase.  Therefore,  each 
node  at  level  r  -\-  k  has  at  lea.st  2  descendants  at  level  r.  Let  /(n)  =  log(cM),  for  some  c, 
c  >  2.  It  follows  that  for  any  element,  x,  and  any  level,  t,  the  point  {x,t-\-  k  ■  f{n))  has  at 
least  cM  descendants  at  level  t.  We  selectively  trim  the  computation  DAG  eliminating  all 
long  edges  which  cross  a  level,  t',  t'  <  t.  This  may  eliminate  some  nodes  from  level  /  which 
were  descendants  of  {x,t  +  k  •  f(n)). 
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Each  node  of  the  execution  tree  at  level  t  will  be  eliminated  with  probability  qi .  If  we 
choose  c  =  2,  the  probability  that  at  least  1/2  the  nodes  at  level  t  are  trimmed  is  at  most 
(2/3)^'°*3/2"  =  1/n^.  Therefore,  for  any  element,  x,  any  round,  t,  and  for  M  =  log2/3  n,  the 
probability  that  the  tree  rooted  at  {x,t  +  2M)  does  not  have  M  nodes  at  level  /  is  at  most 
l/n'^.  We  have  shown: 

Lemma  4.4  For  (i  =  11/10,  for  any  element,  x,  and  for  any  number,  t,  t  >  2[log^  n] ,  the 
probability  that  after  t  +  k(l  +  loglog3/2  n)  rounds  x  has  fewer  than  n  leaf  descendants  is  at 
most  ^  +  0(l/n'^'^). 

Proof  of  Theorem  4.3.  Consider  the  root  of  the  untrimmed  computation  DAG.  For  any 
point,  w  of  the  computation  DAG,  and  any  level,  t,  let  Desc{w,  t)  be  the  set  comprising  the 
level  t  descendants  of  w.  By  the  above  arguments,  for  any  level,  t,  for  any  number,  m  >  2, 
and  for  any  node,  x,  \Desc{{x,t  +  fclogm),  t)\  >  m. 

Applying  Lemma  4.4  to  each  member,  j/,  of  Desc({x,  t  +  k  log  m),t)  we  get  that  for  some 
constant,  /i,  if  <  >  h{\og  n+k  log  log  n)  the  probability  that  y  does  not  have  n  leaf  descendants 
is  at  most  pi  =  3/ti^  +  0{l/n^^).  The  number  of  leaf  descendants  of  {x,  t  +  A; log  m)  is  the 
sum  over  all  members,  y,  of  Desc{{x,t  +  k\ogm),t)  of  the  number  of  leaf  descendants  of 
y.  In  particular,  if  any  such  y  has  n  leaf  descendants  then  {x,t  +  A-logr?i)  has  at  least  n 
leaf  descendants.  It  follows  that  for  some  constant,  /ii,  if  x  is  at  level  hi{\ogn  +  k{\ogm  + 
log  log  n))  the  probability  that  x  does  not  have  at  least  n  leaf  descendants  is  at  most  (pi)"". 
The  input  list  has  n  elements,  therefore,  for  any  number,  m,  m  >  2,  the  recursive  doubling 
algorithm  terminates  in  hi{\ogn  +  k{\ogm  +  log  log  7i))  rounds  with  probability  at  least 
l-0(n-l/n2'")=  l-0(l/n2'"-i).  D 

Lower  Bounds.  Next,  we  show  that  this  bound  is  tight  by  showing  that  for  any  constant, 
a  >  1,  if  p  >  (logn)""  then  for  some  constant,  h',  the  computation  has  length  at  least 
r  •  max{/i'/:loglogn,logn}  rounds,  with  high  probability. 

Consider  the  uniform  adversary,  A:  each  process  is  given  one  pseudo-event  in  each  round 
and  to  each  event  A  associates  elimination  probability  p  and  delay  k.  Consider  any  random 
computation,  C.  For  any  process,  P,  and  any  event,  e,  of  process  P,  the  distance  between 
P  and  its  parent  after  round  t  is  at  most  2*;  thus,  \C\  >  rlogn. 

For  each  event,  e,  of  C,  we  say  that  e  is  slow  if  the  corresponding  pseudo-event  tossed 
slow  (and  as  a  consequence  the  next  k  —  I  pseudo-events  of  the  same  process  were  deleted 
from  the  sequence  of  pseudo-events). 
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Let  s  =  31ogn/(loglogn)^  and  consider  the  list  as  if  it  is  divided  into  n/s  segments  of 
5  contiguous  items.  Let  /x  be  a  constant  to  be  specified  later.  We  caU  a  segment  supersiow 
if  the  first  /i  log  log  n  events  of  each  of  the  processes  associated  with  the  segment  were  slow. 
If  after  k^\og\ogn  rounds  the  first  element,  a;,  in  a  supersiow  segment,  5,  points  to  an 
element,  j/,  inside  5,  y  is  at  distance  exactly  2^^°^^°^^  =  log*^  n  from  x.  If  //  <  i  it  is 
straightforward  to  show  that  for  any  n  >  4,  log''  n  <  s.  Consequently,  the  first  element  of 
S  points  to  an  element  strictly  inside  S  and  the  algorithm  could  not  have  terminated. 

Consider  a  segment,  5,  of  length  s.  Let  Ps  be  the  probability  that  S  is  supersiow.  Then 

p    _      SMloglogn   y   /j^         \-3oAilogn/loglogn   _  ^-3a^ 

There  are  n/s  =  n(loglogn)^/(31ogn)  such  segments.  Let  Q^  be  the  probabihty  that  at 
least  one  such  segment  is  supersiow.  Then  the  probability  that  no  segment  is  supersiow, 
1  -  Qs,  is 

1-Qs    =    (l-P.)"/* 

n(lonlo|;n)^ 

<  (1   -   n-^""")        3Vn 

((1-11-3"'')"'"'')"''''"'"°''"'"''^*"°'"' 

Assuming  3a/i  <  5,  then  for  n  >  16 

/  o     xn^"''        1 

<  (1  -  n-^"")  <  -. 

We  have  shown: 

Theorem  4.4  For  any  number,  a,  a  >  I,  for  any  /^  <  5  such  that  afi  <  |,  and  for  any 
n  >  16  if  p  >  (logn)~"  the  recursive  doubling  algorithm  requires  at  least  k/^iloglogn  rounds 
with  probability  at  least  1  —  7  • 

Corollary  4.4  For  any  number,  a,  a  >  1,  for  any  number,  fi,  fi  <  ^  such  that  afi  <  |, 
and  for  n>  16  ifp>  (logn)"''  then  Ep{A)  >  \{\ogn^{l-\)kn\og\ogn)  >  ^■^^{\ogn  + 
Arloglogn)  rounds. 

Theorem  4.5  For  any  constant,  c  >  1,  if  l/{logny  <  p  <  1/3^-^  and  pk  <  1/18  then 
Ep{A)  =  0(logn  +  A;  log  log  n)  rounds. 

Summary  of  the  Results  Under  the  Bounded  Delays  Model.  Under  the  bounded 
delays  model  the  list  based  recursive  doubling  algorithm  performs  better  than  the  tree  based 
summation  algorithm  under  certain  conditions. 
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We  have  shown  that  for  any  constant,  c,  if  l/(logn)'^  <  p  <  l/S'^-^  and  pk  <  1/18 
then  Ep{RD)  =  0(logn  +  k\og\ogn)  rounds,  where  RD  stands  for  the  recursive  doubling 
algorithm.  On  the  other  hand,  we  have  shown  that  li  pk  is  a  constant,  pk  <  1,  the  expected 
rounds  complexity  of  the  summation  algorithm,  Ep{Sum)  =  0(rA;logn). 

For  example,  when  k  =  log  n/ log  log  n  and  when  P  =  ]f  for  some  constant  /,  Ep(RD)  — 
0(logn)  while  Ep{Sum)  =  0(log^  n/ loglog  n). 
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